131 research outputs found

    Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease

    Get PDF
    BACKGROUND: Genomic functional information is valuable for biomedical research. However, such information frequently needs to be extracted from the scientific literature and structured in order to be exploited by automatic systems. Natural language processing is increasingly used for this purpose although it inherently involves errors. A postprocessing strategy that selects relations most likely to be correct is proposed and evaluated on the output of SemGen, a system that extracts semantic predications on the etiology of genetic diseases. Based on the number of intervening phrases between an argument and its predicate, we defined a heuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We also applied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessed SemGen predications to investigate the genetic basis of Parkinson's disease. RESULTS: The filtering procedure for increased precision is based on the intuition that arguments which occur close to their predicate are easier to identify than those at a distance. For example, if gene-gene relations are filtered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to 70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrence processing is useful, but not as effective as when applied to the output of natural language processing. In an effort to exploit SemGen predications on the etiology of disease after increasing precision with postprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering and was automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypic information from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders. CONCLUSION: Information based on the proximity postprocessing method we suggest is of sufficient quality to be profitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximity filtering is only marginally effective for enhancing the precision of relations extracted with co-occurrence processing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation

    Machine learning applied to the h index of colombian authors with publications in scopus

    Get PDF
    Our research aims to establish how to predict the H index of Colombian authors with publications in Scopus until 2016. The selection of the date was because, as mentioned earlier, the number of documents indexed per year exceeded 10,000 and they obtained the highest number of documents cited. To accomplish this purpose, a quantitative, nonexperimental, cross-sectional, descriptive, explanatory, and predictive research was designed using supervised learning algorithms. These were applied to information from 8,840 Colombian authors. Among the findings we can highlight that: (i) Colombia is in the fifth position in the scope of countries of South America and the Caribbean, in terms of the number of products and citations; (ii) the largest number of Colombian authors with products in Scopus until 2016, belonged mainly to the area of natural sciences, followed by medical sciences and health; (iii) most of the Colombian authors were men (64.2%, or 5,442) and they have higher H index rates than women; (iv) using random cross validation for 10 iterations, the methods with the best predictive value using R2 and the minimization of mean absolute error (MAE) correspond to: AdaBoost (96.6% and 0.397, respectively); Random Forest (96.8% and 0.431, respectively); KNN (94.4% and 0.525, respectively); Tree (94.9% and 0.53, respectively); and Neural Network (93.3% and 0.7, respectively); and (v) the variables that help predict the H index in the case of the Colombian authors, in addition to the citations, correspond to: the quantity of products, number of products in Q1, and international collaboratio

    Clustering cliques for graph-based summarization of the biomedical research literature

    Get PDF
    BACKGROUND: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). RESULTS: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. CONCLUSIONS: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively
    corecore